Natural Language in Information Retrieval
نویسنده
چکیده
It seems the time is ripe for the two to meet: NLP has grown out of prototypes and IR is having hard time trying to improve precision. Two examples of possible approaches are considered below. Lexware is a lexiconbased system for text analysis of Swedish applied in an information retrieval task. NLIR is an information retrieval system using intensive natural language processing to provide index terms on a higher level of abstraction than stems. 1 Not Much Natural Language in Information Retrieval so Far Problems of finding the right data in a big data collection had been addressed long before NLP and are still addressed without NLP. The two fields hardly meet: “There is (...) an inherent granularity mismatch between the statistical techniques used in information retrieval and the linguistic techniques used in natural language processing.” [8]. The results obtained in attempts of using NLP in information retrieval were so poor that the title of an article describing yet another test in 2000 is meant to surprise: “Linguistic Knowledge Can Improve Information Retrieval” [1]. The tenet of SMART seems to be still generally valid in IR: “good information retrieval techniques are more powerful than linguistic knowledge” [2]. When NLP-track was introduced in TREC in the nineties, several experiments proved that language resources can actually help. The gain in recall and precision is not negligible even if far from a dramatic breakthrough. For instance, adding simple collocations to the list of available terms could improve precision by 10%. [2] More advanced NLP techniques remain too expensive for large-scale applications: “the use of full-scale syntactic analysis is severely pushing the limits of practicality of an information retrieval system because of the increased demand for computing power and storage.” [6]. 2 NLIR – a Natural Language Information Retrieval NLIR and Lexware are examples of projects which pursue improvement in IR by incorporation of NLP, each in a different way. The conviction behind the Natural Language Information Retrieval system – NLIR, is that “robust NLP techniques can help to derive better representation of text documents for indexing and search purposes than any simple word and string-based methods commonly used in statistical full-text retrieval.” [6] The system is organized into a “stream model”. Each stream provides an index representing a document in one special aspect. Various streams have been tried and reported in TREC, from 1995 on. Streams are obtained from different NLP methods which are run in parallel on a document. Contribution of each stream is optimised during merging the results of all streams. All kinds of NLP methods are tested in NLIR. In TREC-5 a Head-Modifier Pairs Stream involves truly intensive natural language processing: part of speech tagging, stemming supported with a dictionary, sentence analysis with Tagged Text Parser, extraction of head-modifier pairs from the parse trees, corpus-based disambiguation of long noun phrases. Abstract index terms are obtained from the stream, in which paraphrases like information retrieval, retrieval of information, retrieve more information, etc can be linked together. In TREC-7 the streams are yet more sophisticated, e.g. a functional dependency grammar parser is used, which allows linking yet more paraphrases, e.g. flowers grow wild and wild flowers. The conclusions are positive but cautious: “(...) it became clear that exploiting the full potential of linguistic processing is harder than originally anticipated.” [7] The results prove also that it is actually not worth the effort because the complex streams turn out to be the less effective than a simple Stems Stream, i.e. content words. The approach of NLIR is a traditional statistical IR backbone with NLP support in recognition of various text items, which in turn is supposed to provide index terms on a higher level of abstraction than stems. The approach of Lexware is almost opposite: an NLP backbone plus support from statistics in assigning weights to abstract
منابع مشابه
Improved Skips for Faster Postings List Intersection
Information retrieval can be achieved through computerized processes by generating a list of relevant responses to a query. The document processor, matching function and query analyzer are the main components of an information retrieval system. Document retrieval system is fundamentally based on: Boolean, vector-space, probabilistic, and language models. In this paper, a new methodology for mat...
متن کاملImproved Skips for Faster Postings List Intersection
Information retrieval can be achieved through computerized processes by generating a list of relevant responses to a query. The document processor, matching function and query analyzer are the main components of an information retrieval system. Document retrieval system is fundamentally based on: Boolean, vector-space, probabilistic, and language models. In this paper, a new methodology for mat...
متن کاملبررسی تأثیرات ریشهیابی در بازیابی اطلاعات در زبان فارسی
Using the language-specific behavior in information retrieval systems can improve the quality of the retrieved results significantly. Part of the word that remains after removing its affixes is called stem. Stemming process can be used for improving the relevancy of the results in information retrieval system. Different morphological variants of words (plural, past tense…) will be mapped into t...
متن کاملStatistical Language Models and Information Retrieval: natural language processing really meets retrieval
Traditionally, natural language processing techniques for information retrieval have always been studied outside the framework of formal models of information retrieval. In this article, we introduce a new formal model of information retrieval based on the application of statistical language models. Simple natural language processing techniques that are often used for information retrieval – we...
متن کاملImpact of Controlled and Free Language Use in Retrieving Articles from the ProQuest and Science Direct Databases
Abstract Introduction: The growth and expansion of the Internet has changed the way information is accessed and many facilities have been created on the Web to facilitate and expedite information locating. Objective: To identify the impact of keyword documentation using the medical thesaurus on the retrieval of articles from Proquest and Science Direct databases. Materials and Methods:The pr...
متن کاملApplying Light Natural Language Processing to Ad-Hoc Cross Language Information Retrieval
In the CLEF 2005 Ad-Hoc Track we experimented with language-specific morphosyntactic processing and light Natural Language Processing (NLP) for the retrieval of Bulgarian, French, Italian, English and Greek.
متن کامل